Parallelizing Software-Implemented Error Detection
نویسندگان
چکیده
Because of economic pressure, more commodity hardware with insufficient error detection is used in critical applications. Moreover, it is expected that commodity hardware is becoming less reliable because of the continuously decreasing feature size. Thus, we expect that software-implemented approaches to deal with unreliable hardware will be needed. Arithmetic codes are well suited for this purpose because they can provide very good error detection capabilities independent of the actual failure modes of the underlying hardware. But arithmetic codes generate high slowdowns. This paper describes our encoding which uses an expensive AN-code. Second, we show how we harness the power of modern multicore CPUs to parallelize this expensive but flexible and powerful software-implemented fault detection technique. Our measurements show that under continuous probabilistic error injection, AN-encoding reduces the number of runs with incorrect output from 15.9% for the unencoded execution to 0.5% in the encoded case. Our parallelization reduces the observed slowdowns by an order of magnitude.
منابع مشابه
Hierarchical Error Detection in a Software Implemented Fault Tolerance (sift) Environment by Saurabh Bagchi
متن کامل
An Evaluation of the Error Detection Mechanisms in MARS Using Software-Implemented Fault Injection
The concept of fail silent nodes greatly simpli es the design and safety proof of highly dependable fault tolerant computer systems The MAintainable Real Time System MARS is a computer system where the hardware operating system and application level error detec tion mechanisms are designed to ensure the fail silence of nodes with a high probability The goal of this paper is two fold First the e...
متن کاملThe FTMPS-Project: Design and Implementation of Fault-Tolerance Techniques for Massively Parallel Systems
The FTMPS-project provides a solution to the need for faulttolerance in large systems . A complete fault-tolerance approach is developed and being implemented . The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as perananent failures . Combined with the diagnosis software, the necessary information for t...
متن کاملThe FTMPS { Project : Design and Implementation of Fault { Tolerance Techniques for Massively Parallel Systems 1
The FTMPS-project provides a solution to the need for fault{ tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the...
متن کاملKUDA: GPU Accelerated Split Race Checker
We propose a novel approach for runtime verification on computers with a large number of computation cores, without any hardware extension to mainstream PC environment. The goal of the approach is making use of all hardware resources to decouple the computational overhead of traditional race checkers via parallelizing the runtime verification. We distinguish between two kinds of computational o...
متن کامل